Parrotpark: Why and How to self-host LLMs

Jonas Stettner | CorrelAid @ CDL

2025-05-07

Agenda

  1. What is Parrotpark?
  2. Why self-hosting?
  3. How to self-host?
  4. What to self-host?
  5. Parrotpark Architecture
  6. Evaluation
  7. Demonstration
  8. Discussion

What is Parrotpark? 🦜

  • Cooperation with D64 in the context of the project “Code of Conduct: Democratic AI”
  • Experimental infrastructure project for self-hosting LLMs and accompanying applications
  • Targeted at work in civil society organizations
  • Project is finished and ended with an evaluation

Why self-hosting?

  • Digital Sovereignty
  • Dependence, lack of transparency and little control:
    • Data processing (GDPR)
    • Resource consumption
    • Properties and training of the models
    • Model and tool usage/configuration; e.g. web search (🗲 GUI apps such as GPT Builder)

How to self-host: LLM hosting options

  • LLM inference:
    • Azure OpenAI on EU servers - ✅ GDPR
    • Open models - ✅ More transparent model characteristics
      • API services hosted in the EU
      • Dedicated GPU server - ✅ Fully transparent resource consumption (only inference)

Dedicated vs API: Costs for EU Provider Scaleway

  • Claude 4 Opus on Open Router: $15/M input tokens; $75/M output tokens

Dedicated vs API: Costs for EU Provider Scaleway

  • How much VRAM can we afford?: L4 with 24GB, limits model choices + context window size

Dedicated vs API: Example Pricing Calculation

  • Scaleways allows automated GPU instance creation (unlike Hetzner), so we deploy only during working hours
    • \(\text{Cost} = \text{€}0.75 \times (10\,\text{h} \times 5\,\text{days} \times 4\,\text{weeks}) = \text{€}150\)
    • Including tax (19%): €178.50
  • Compared to Mistral Small 3.2 24B via OpenRouter (assuming 50/50 input/output split):
    • €178.50 = $210.27 (at €1 = $1.178)
    • \(\frac{\text{\$}105.14}{0.05} + \frac{\text{\$}105.14}{0.10}\) = 3,154M tokens
    • Per working day: \(\frac{3{,}154}{20} = 158\,\text{M tokens/day}\)

Dedicated vs API: Why Dedicated?

  • Maximum control and transparency
  • More predictable/fixed costs
  • One can fit other services on the GPU server
  • Exact metrics on hardware and inference server level

What to self host? - Which LLM?

  • Model size, context and concurrency:
    • Infrastructure choice limits model options
    • Pre-quantised models on Hugging Face
  • Which language and task is the model used for?
    • “When a measure becomes a target, it ceases to be a good measure.”
    • For automated tests: LLM as a judge
  • Finetuning vs. Prompt Engineering

What to self host? - Other required services

  • Inference Server: Ollama, vLLM, Hugging Face TGI etc.
  • LiteLLM: Control over the API
  • Chat Interface: LibreChat, Open WebUI
  • Databases (Vector, Application) and File Storage
  • Metrics
  • Auth
  • MCP/Stuff for Tools, e.g. web search

Parrotpark Architecture: High Level Overview

]

Parrotpark Architecture: Misc. Details

  • Implementation as Infrastructure as Code (Terraform and Ansible)
    • Nested (Entrance Server periodically runs IaC)
  • Metrics scraped with Telegraf and sent to external TimeScale Database, connected to Metabase

Evaluation

  • Time window: June 17th to June 27th (9 working days)
  • Scraped Metrics: http://mtbs.correlaid.org/public/dashboard/6032e4e9-e87a-49d7-bd67-f0d92552cc1c
  • User Survey

Evaluation: Tokens and Pricing

  • Total processed tokens: 329,503 input / 103,083 output
  • ❌ API service for the same model would have cost waaaaay less: $0.027 vs ~(€178.5/2)=€89

Demonstration